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Abstract 


Accurate  classification  of  samples  using  gene  expression  profiles  is  critically  dependent  on  the 
method  used  to  select  relevant  genes.  We  present  the  Bayesian  Model  Averaging  (BMA) 
method  for  gene  selection  and  classification  of  microarray  data.  Typical  gene  selection  and 
classification  procedures  ignore  model  uncertainty  and  use  a  single  set  of  relevant  genes 
(model)  to  predict  the  class.  BMA  accounts  for  the  uncertainty  about  the  best  set  to  choose  by 
averaging  over  multiple  models  (sets  of  potentially  overlapping  relevant  genes). 

We  showed  that  BMA  selects  smaller  numbers  of  relevant  genes  (compared  to  other  methods) 
and  achieves  high  prediction  accuracy  on  three  microarray  datasets.  Our  BMA  algorithm  is 
applicable  to  microarray  datasets  with  any  number  of  classes,  and  outputs  posterior  probabilities 
for  the  selected  genes  and  models.  Our  selected  models  typically  consist  of  only  a  few  genes. 
The  combination  of  high  accuracy,  small  numbers  of  genes  and  posterior  probabilities  for  the 
predictions,  should  make  BMA  a  powerful  tool  for  developing  diagnostics  from  expression 
data. 
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1  INTRODUCTION 

There  has  been  a  recent  explosion  in  the  use  of  microarray  data  for  classification  in  a  variety  of 
diagnostic  areas.  The  prediction  of  the  diagnostic  category  of  a  tissue  sample  from  its  expression 
array  phenotype  given  the  availability  of  similar  data  from  tissues  in  identified  categories  is  known  as 
classification  (tor  supervised  learning).  In  the  context  of  gene  expression  data,  the  samples  are  usually 
the  experiments,  and  the  classes  are  usually  different  types  of  tissue  samples,  for  example,  cancer 
vs.  non-cancer  (Alon  et  al.  1999;  Schummer  et  al.  1999),  different  tumor  types  (Golub  et  al.  1999; 
Alizadeh  et  al.  2000;  Ross  et  al.  2000;  Bhattacharjee  et  al.  2001;  Ramaswamy  et  al.  2001),  response 
to  therapy  (Shipp  et  al.  2002;  van 't  Veer  et  al.  2002;  Nutt  et  al.  2003).  A  challenge  in  predicting  the 
diagnostic  categories  using  microarray  data  is  that  the  number  of  genes  is  usually  much  greater  than 
the  number  of  tissue  samples  available,  and  only  a  subset  of  the  genes  is  relevant  in  distinguishing 
different  classes.  Selection  of  relevant  genes  for  classification  is  known  as  feature  selection.  A  small 
set  of  relevant  genes  is  essential  for  the  development  of  inexpensive  diagnostic  tests. 

Multi-class  classification  in  which  the  data  consist  of  more  than  two  classes  is  rapidly  gaining  attention 
in  the  literature.  For  example,  Ramaswamy  et  al.  (2001 )  combined  support  vector  machines,  which 
are  binary  classifiers,  to  solve  the  multi-class  classification  problem.  Nguyen  and  Rocke  (2002a; 
2002b)  used  partial  least  squares  (PLS)  for  feature  selection,  together  with  traditional  classification 
algorithms  such  as  logistic  discrimination  and  quadratic  discrimination  to  classify  multiple  tumor  types 
on  microarray  data.  Tibshirani  et  al.  (2002)  developed  an  integrated  feature  selection  and 
classification  algorithm  called  shrunken  centroid  for  classifying  multiple  cancer  types  in  which  features 
are  selected  by  considering  one  gene  at  a  time.  Yeung  and  Bumgarner  (2003)  extended  the  shrunken 
centroid  algorithm  to  take  dependency  between  genes  and  repeated  measurements  into 
consideration.  Dudoit  et  al.  (2002)  compared  the  performance  of  different  discrimination  methods, 
including  nearest  neighbor  classifiers,  linear  discriminant  analysis,  and  classification  trees,  for 
classifying  multiple  tumor  types  using  gene  expression  data. 

The  method  used  for  selecting  relevant  genes  is  critical  to  the  performance  of  all  classification 
algorithms.  Most  feature  selection  methods  in  the  literature  are  tailored  towards  binary  classification, 
and  are  univariate  in  the  sense  that  each  candidate  relevant  gene  is  considered  individually. 

Examples  of  univariate  methods  include  the  signal-to-noise  ratio  (Golub  et  al.  1999),  the  t-test 
(Nguyen  and  Rocke  2002b),  the  ratio  of  between-groups  to  within-groups  sum  of  squares  (BSS/WSS) 
(Dudoit  et  al.  2002),  the  Significance  Analysis  of  Microarray  (SAM)  statistic  (Tusher  et  al.  2001),  the 
Threshold  Number  of  Misclassification  (TNOM)  score  (Ben-Dor  et  al.  2000),  the  Wilcoxon  test  statistic 
(Dettling  2004)  and  many  others.  Multivariate  gene  selection  methods  consider  multiple  genes 
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simultaneously,  and  hence,  account  for  dependency  between  genes,  which  hopefully  will  lead  to  a 
reduced  number  of  relevant  genes.  Bo  and  Jonassen  (2002)  evaluated  relevant  genes  in  a  pairwise 
fashion,  while  Jaeger  et  al.  (2003)  and  Yeung  and  Bumgarner  (2003)  reduced  the  number  of  relevant 
genes  by  eliminating  highly  correlated  ones.  Recently,  Lee  et  al.  (2003)  employed  a  hierarchical 
Bayesian  model  which  used  a  Markov  Chain  Monte  Carlo  (MCMC)  based  stochastic  search  algorithm 
to  discover  relevant  genes.  Their  multivariate  gene  selection  algorithm  is  applicable  to  microarray  data 
with  two  classes  only.  Sha  et  al.  (2004)  extended  the  underlying  theory  to  multiple  classes  data  as 
well,  but  did  not  give  empirical  results  for  gene  selection  on  multi-class  microarray  data. 

In  addition,  most  proposed  feature  selection  and  classification  algorithms  ignore  model  uncertainty  by 
selecting  one  set  of  relevant  genes,  and  then  by  using  class  prediction  on  that  set  of  selected  genes. 

It  is  possible  that  there  is  more  than  one  set  of  relevant  genes  that  fit  the  model  equally  well, 
especially  with  microarray  data  in  which  the  number  of  genes  (variables)  is  much  greater  than  the 
number  of  samples.  There  have  been  efforts  to  use  model  averaging  and  model  ensemble 
approaches  to  classify  microarray  data.  As  an  example,  Li  and  Yang  (2002)  applied  a  model 
averaging  approach  to  classify  samples  by  averaging  over  multiple  single-gene  models  to  microarray 
data.  Boosting  algorithms  have  also  been  applied  to  microarray  data  (Ben-Dor  et  al.  2000;  Dudoit  et 
al.  2002;  Dettling  and  Buhlmann  2003;  Dettling  2004). 

In  this  paper,  we  present  the  Bayesian  Model  Averaging  (BMA)  approach  (Raftery  1995;  Hoeting  et  al. 
1999;  Viallefont  et  al.  2001)  as  our  multivariate  feature  selection  method  for  multi-class  microarray 
data.  This  is  in  contrast  to  Li  and  Yang  (2002)  in  which  the  emphasis  was  on  classification  and  genes 
were  selected  independently.  Our  approach  also  differs  from  Lee  et  al.  (2003)  and  Sha  et  al.  (2004)  in 
the  sense  that  we  adopt  a  model  averaging  approach  and  we  report  empirical  results  on  multi-class 
as  well  as  binary  microarray  data.  In  addition,  our  algorithms  are  computationally  efficient  compared 
to  the  MCMC  based  algorithms  in  Lee  et  al.  (2003)  and  Sha  et  al.  (2004).  We  extended  an  existing 
BMA  algorithm  to  be  applicable  to  any  number  of  input  variables  (genes),  and  to  any  number  of 
classes.  We  show  that  our  extended  BMA  algorithm  generally  selects  fewer  relevant  genes  and 
produces  prediction  accuracy  at  least  comparable  to  that  of  the  best  existing  feature  selection  and 
classification  methods.  We  also  propose  to  use  the  Brier  Score  (Brier  1950)  and  use  a  generalized 
Brier  Score  to  assess  prediction  accuracy  for  2-class  and  multi-class  datasets  respectively.  Our 
approach  has  the  additional  advantage  of  facilitating  biological  interpretation  by  producing  posterior 
probabilities  of  selected  genes  and  models.  Our  BMA  algorithm  is  a  multivariate  gene  selection 
method,  and  our  selected  models  are  typically  very  simple,  consisting  of  only  a  few  genes.  By 
averaging  over  multiple  simple  models  and  using  relatively  small  numbers  of  relevant  genes,  we 
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demonstrate  high  prediction  accuracy  on  both  binary  and  multi-class  microarray  data.  In  Section  2,  we 
review  Bayesian  Model  Averaging  (BMA)  and  describe  our  extension  of  existing  BMA  algorithms  to 
large  numbers  of  predictors  and  multi-class  classification  problems,  and  in  Section  3,  we  give  results 
for  three  gene  expression  datasets. 


2  METHODS 

2.1  Bayesian  Model  Averaging  (BMA) 

Typical  model  selection  approaches  select  a  model  and  then  proceed  as  if  the  selected  model  has 
generated  the  data,  which  might  lead  to  over-confident  inferences.  Bayesian  Model  Averaging  (BMA) 
takes  model  uncertainty  into  consideration  by  averaging  over  the  posterior  distributions  of  multiple 
models,  weighted  by  their  posterior  model  probability  (Raftery  1995;  Hoeting  et  al.  1999). 

For  simplicity,  let  us  first  consider  the  binary  classification  problem.  Let  Y  be  the  response  variable 
(class)  of  a  sample  in  the  test  set,  where  Y  =  0  or  1 ,  and  let  D  be  the  training  dataset  for  which  the 
classes  are  known.  The  essence  of  BMA  is  shown  in  Equation  1 :  the  posterior  probability  of  Y=1  given 
the  training  set  D  is  the  weighted  average  of  the  posterior  probability  of  Y=1  given  the  training  set  D 
and  model  Mk  multiplied  by  the  posterior  probability  of  model  Mk  given  training  set  D,  summing  over  a 
set  of  models  Mk  in  M: 

Pr(7  =  1 1 D)  =  J>r(7  =  1 1  D,Mk)  *  Pr (Mk  D).  (1 ) 

ke  M 

BMA  presents  several  implementation  difficulties.  One  of  these  is  that  the  exhaustive  summation  of  all 
feasible  models  leads  to  an  enormous  number  of  terms  in  Equation  1.  Raftery  (1995)  used  the  leaps 
and  bounds  algorithm  (Furnival  and  Wilson  1974)  to  efficiently  identify  a  reduced  set  of  good  models. 
The  leaps  and  bounds  algorithm  rapidly  returns  the  best  “nbest”  models  of  each  size  (up  to  30 
variables).  Madigan  and  Raftery  (1994)  proposed  to  use  the  Occam’s  window  method  to  choose  a  set 
of  parsimonious  and  data-supported  models.  Their  idea  is  to  discard  models  that  are  much  less  likely 
than  the  best  model  supported  by  the  data  (the  default  is  20  times  less  likely). 

A  second  difficulty  with  BMA  is  that  there  is  an  implicit  integral  associated  with  the  evaluation  of  the 
posterior  probability  for  model  Mk  given  training  set  D.  Using  Bayes’s  Theorem,  Pr(Mk|D)  is 
proportional  to  Pr(D|Mk)*Pr(Mk),  where  Pr(D|Mk)  is  the  integrated  likelihood  of  model  Mk  in  which  the 
regression  parameters  for  model  Mk  are  integrated  over.  Please  refer  to  Raftery  (1995)  and  Hoeting  et 
al.  (1999)  for  the  mathematical  details.  There  are  many  different  ways  to  approximate  this  integral 
including  MCMC  approximations  (Madigan  and  York  1995).  In  this  paper,  we  use  logistic  regression 
(Hosmerand  Lemeshow  2000)  to  predict  Pr(Y=1|D,  Mk)  such  that  ln[Pr(Y=1|D,  Mk)/  Pr(Y=0|D,  Mk)]  = 
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b0+b1x1+...+bpXp,  where  Xj’s  represent  the  expression  levels  of  selected  genes  and  b,’s  are  the 
regression  parameters.  In  this  case,  the  Bayesian  Information  Criterion  (BIC)  can  be  used  to 
approximate  the  integral  (Raftery  1995).  We  adopt  the  BMA  implementation  (Raftery  1995)  which 
makes  use  of  the  BIC  approximation.  The  source  code  of  the  BMA  implementation  is  available  at 
http://www.research.att.com/~volinsky/bma.html. 

P-values  computed  from  typical  variable  selection  procedures  (such  as  stepwise  forward  or  backward 
selection)  have  been  shown  to  overstate  the  strength  of  inference  (Raftery  1995;  Viallefont  et  at. 
2001).  One  of  the  advantages  of  BMA  is  that  it  yields  an  easily  interpreted  summary:  posterior 
probabilities  for  the  selected  models  and  the  selected  genes  (variables).  In  particular,  our  adopted 
traditional  BMA  implementation  (Raftery  1995)  outputs  the  posterior  probability  that  each  variable  is 
non-zero,  probneO. 


2.2  Our  modifications  to  existing  BMA  algorithms 
Iterative  BMA  algorithm 

With  microarray  data,  the  number  of  genes  (variables)  is  typically  much  greater  than  the  number  of 
samples  (responses).  However,  in  the  traditional  BMA  implementation  (Raftery  1995),  the  leaps  and 
bounds  algorithm  can  only  compute  the  best  “nbest”  models  for  up  to  30  variables,  and  if  the  number 
of  variables  is  greater  than  30,  backward  elimination  is  used  to  reduce  the  number  of  variables  to  30 
before  applying  the  leaps  and  bounds  algorithm.  However,  stepwise  backward  elimination  in  which 
one  variable  is  removed  at  a  time  cannot  be  applied  in  this  situation  in  which  there  are  more 
predictors  (genes)  than  observations  (samples).  Instead,  we  developed  an  iterative  BMA  algorithm 
which  first  rank  orders  genes  with  a  univariate  gene  selection  method  and  then  moves  a  30-variable 
window  down  the  ordered  genes.  Recall  that  probneO  represents  the  posterior  probability  of  a  gene 
not  being  equal  to  zero,  and  hence,  genes  with  high  probneO  are  good  candidates  for  relevant  genes. 

Outline  of  Iterative  BMA  Algorithm 

Input:  training  set  D  with  G  genes  and  n  samples 

Pre-processing  step:  Rank  all  G  genes  using  a  univariate  gene  selection  procedure.  Let  x^  x2 . xG 

be  the  ordered  list  of  genes. 

Parameters:  nbest  and  p,  where  p  is  the  total  number  of  genes  to  be  processed  such  that  30<  p  =  G. 
1 .  Initially,  start  with  the  30  top  ranked  genes  (x-i,  x2,  ...,  x30),  and  apply  the  traditional  BMA 
algorithm.  Let  toBeProcessed  be  an  ordered  list  of  genes  with  ranks  31  to  p.  Initially, 
toBeProcessed  <-  x3i,  x32 . xp. 
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2.  Repeat  until  all  p  genes  are  processed 


a.  Remove  all  genes  with  probneO  <  1  %. 

b.  Adaptive  threshold  step:  If  all  genes  have  probneO  =  1%,  determine  the  minimum 
probneO,  minProbneO,  among  the  30  genes  in  the  current  window.  Remove  all  genes 
with  probneO  <  ( minProbneO  +1)%. 

c.  Let  removedGenes  be  the  set  of  genes  removed,  and  suppose  q  genes  are  removed. 

d.  Replace  the  q  removed  genes  with  the  next  q  genes  from  toBeProcessed.  Update 
toBeProcessed  <-  toBeProcessed  -  removedGenes. 

e.  Apply  the  traditional  BMA  algorithm. 

Output:  selected  models  and  their  posterior  probabilities,  selected  genes  and  their  corresponding 
probneO,  maximum  likelihood  estimates  of  the  regression  parameters  in  each  model 


In  our  study,  we  used  the  ratio  of  between-group  to  within-group  sum  of  squares  (BSS/WSS)  (Dudoit 
et  al.  2002)  to  determine  the  initial  gene  order.  Intuitively,  genes  with  relatively  large  variation  between 
classes  and  relatively  small  variation  within  classes  are  likely  candidates  as  relevant  genes. 

BSS/WSS  is  a  univariate  gene  selection  method  in  which  genes  with  large  BSS/WSS  ratios  are  good 
candidate  relevant  genes.  For  a  gene  j,  let  Djj  denote  the  expression  level  of  gene  j  under  sample  i, 

Dkj  denote  the  average  expression  level  of  gene  j  over  samples  in  class  k,  and  D.j  denote  the 
average  expression  level  of  gene  j  over  all  samples.  The  BSS/WSS  ratio  for  gene  j  is  defined  as 


EE/(^)(d,-£,)2 

mn=rr_ 

WSS(J)  ~  EE^  =  k){Dj-D^  ’ 


(2) 


where  l(Yi=k)  is  equal  to  1  if  sample  i  belongs  to  class  k  and  is  equal  to  0  otherwise.  In  step  1  of  the 
iterative  BMA  algorithm,  we  compute  the  BSS/WSS  ratio  for  each  of  the  G  genes  and  order  the  genes 
in  descending  order  of  the  BSS/WSS  ratio. 


Multi-class  Iterative  BMA 

For  multi-class  microarray  data,  we  developed  an  individualized  regression  approach  in  which  binary 
logistic  regressions  are  combined.  We  used  the  approximation  of  Begg  and  Gray  (1984)  (also 
discussed  in  Chapter  8  of  Hosmer  and  Lemeshow  (2000)).  They  studied  the  use  of  a  series  of 
individualized  binary  logistic  regressions  as  an  approximation  for  polychotomous  logistic  regression  in 
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which  the  response  variable  can  take  more  than  two  values.  They  showed  that  this  provides  a  close 
approximation  to  maximum  likelihood  estimation  of  the  full  multinomial  logistic  regression  model.  For 
our  purposes,  it  is  particularly  attractive  because  it  allows  us  to  use  the  well-established  and 
computationally  efficient  algorithms  for  BMA  in  binary  logistic  regression  when  building  BMA  for  multi¬ 
class  classification. 

Suppose  there  are  K  classes  such  that  the  response  variable  (class)  Y  takes  on  values  0,  1 . or  (K- 

1),  where  K  =  3,  and  let  Y  be  the  response  variable  for  sample  i.  Our  idea  is  to  use  a  separate  binary 
logistic  regression  to  discover  relevant  genes  for  each  training  subset  (Y=0  vs.  Y=k),  where  k  =  1 , 
(K-1),  and  use  the  Begg  and  Gray  (1984)  approach  to  create  an  augmented  matrix  M  to  approximate 
polychotomous  logistic  regression  using  the  selected  genes  from  each  training  subset  with  binary 
logistic  regression.  Figure  1  shows  a  flowchart  of  our  algorithm  with  an  example  augmented  matrix  M 
for  K=3.  The  augmented  matrix  M  is  formed  by  concatenating  the  selected  genes  from  each  training 
subset  and  pasting  the  two  training  subsets  (Y=0  vs.  Y=1)  and  (Y=0  vs.  Y=2)  together.  There  is  a 
column  in  M  for  the  regression  parameter  of  each  gene.  The  first  n-i  rows  of  M  correspond  to  samples 
with  Y=0  or  Y=1  and  the  next  n2  rows  of  M  correspond  to  samples  with  Y=0  or  Y=2.  Finally,  we  order 
the  columns  in  M  using  BSS/WSS  ratios  and  apply  the  iterative  BMA  algorithm  to  M  to  discover 
relevant  genes. 

Outline  of  multi-class  iterative  BMA 

1 .  Using  Y=0  as  our  baseline,  create  subsets  of  the  samples  from  the  training  set  for  the  binary 
classification  problem  in  which  Y=0  or  Y=k,  where  k  =  1, 2,  ..,  (K-1),  and  ignore  all  the  data 
from  Y?  0  and  Y?  k.  Denote  the  number  of  training  samples  for  Y=0  vs.  Y=k  by  nk.  In  the 
training  subset  (Y=0  vs.  Y=k),  the  response  variable  Y*=0  when  Y=0,  and  Y*=1  when  Y=k. 

2.  For  each  training  sample  subset  (Y=0  vs.  Y=k)  where  k=1 ,2,  ...,  (K-1),  apply  the  iterative  BMA 
algorithm,  and  let  Sk  be  the  set  of  selected  genes  from  this  subset. 

3.  Merge  the  selected  genes  from  each  training  sample  subset  to  create  an  augmented  design 

K  K-1 

matrix  with  ordered  columns,  M,  which  has  ^nk  rows  and  (K-2  +  ^|  Sk  |)  columns 

k= 1  k= 1 

(variables). 

a.  Compute  BSS/WSS  ratios  for  each  gene  in  Sk  from  each  training  sample  subset  k. 

b.  Sort  the  BSS/WSS  ratios  from  all  (K-1)  training  sample  subsets  Sk. 
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c.  The  first  (K-2)  columns  of  the  design  matrix  M  represent  the  “intercept”  columns  while 
all  other  columns  represent  genes  (variables).  The  first  n-i  rows  of  M  represent  the 
training  sample  subset  Y=0  or  Y=1 ,  and  the  next  n2  rows  of  M  represent  Y=0  or  Y=2 
etc. 

d.  For  k=  2  to  (K-1 ),  M[i,  k-1]=1  for  any  sample  i  in  training  subset  k  in  which  Yj=0  or  Yj=k, 
and  M[i,  k-1]  =0  for  all  other  samples. 

e.  For  k=1  to  (K-1 )  and  each  gene  g  in  Sk,  M[i,  (K-2)+r]=Dig  where  r  is  the  rank  of  gene  g 
from  step  (3b)  and  Dig  is  the  expression  level  of  gene  g  under  sample  i  in  the  training 
set  D  for  any  sample  i  in  training  subset  k  (YpO  vs.  Ypk),  and  M[i,  (K-2)+r]=0 
otherwise. 

f.  The  response  variable  for  M,  YM=0  for  Y=0,  and  YM=1  for  Y=k  where  k=1 ,  2,..,  (K-1 ). 

4.  Apply  the  iterative  BMA  algorithm  to  the  augmented  data  matrix  M. 

5.  Prediction  step:  use  the  regression  parameters  from  the  selected  variables  from  Step  4. 


2.3  Evaluation  of  predictive  performance 

In  the  literature,  the  number  of  classification  errors  is  the  most  popular  measure  of  predictive 
performance,  for  example,  (Golub  et  al.  1999;  Nguyen  and  Rocke  2002a;  van 't  Veer  et  al.  2002;  Lee 
et  al.  2003).  However,  in  our  case,  the  predicted  probability  for  each  class,  Pr(Y=k|D),  is  available.  For 
example,  a  predicted  probability  close  to  0  or  1  is  more  desirable  than  a  predicted  probability  around 
0.5  in  the  binary  classification  case.  In  order  to  take  the  magnitudes  of  predicted  probabilities  into 
consideration,  we  adopted  the  Brier  Score  (Brier  1950)  as  our  evaluation  measure.  For  binary  data,  let 
Yi  denote  the  response  variable  (class)  of  sample  i,  where  Y,  =  0  or  1 .  Denote  the  predicted  probability 

n  2 

that  sample  i  belongs  to  class  1,  Pr(Y,=1|D),  by  p,.  The  Brier  Score  is  defined  as  ^(Y)  -p,)  ,  which  is 

/= i 

the  sum  of  squares  of  the  difference  between  the  true  class  and  the  predicted  probability  over  all 
samples.  If  the  predicted  probabilities,  pi;  are  constrained  to  equal  to  0  or  1 ,  the  Brier  Score  is  equal  to 
the  total  number  of  classification  errors.  Thus  the  Brier  Score  allows  us  to  compare  the  performance 
of  the  deterministic  0-1  classification  methods  with  that  of  probabilistic  methods  such  as  BMA,  which 
is  an  appealing  feature. 

We  use  the  generalized  Brier  Score  for  the  multi-class  case,  where  Y,  =  0,  1,  ...,  (K-1).  Let  Yik  be  an 
indicator  variable  such  that  Yik=1  if  Y,=k  and  Yik=0  otherwise,  where  k  =  0,  1 . (K-1 ).  Let  pik  denote 


-  10- 


,|  n  K- 1 

the  predicted  probability  that  Yj  =k.  The  generalized  Brier  Score  is  defined  as  5EE<y»  -Pikf  ■  It 

2  /= 1  fc=o 

can  be  shown  that  the  generalized  Brier  Score  is  reduced  to  the  Brier  Score  when  K=2.  A  high 
generalized  Brier  Score  indicates  poor  predictive  performance. 


3  RESULTS 

3.1  Breast  cancer  prognosis  data  (2-class) 

The  breast  cancer  prognosis  dataset  (van 't  Veer  et  al.  2002)  consists  of  primary  breast  tumor 
samples  hybridized  to  cDNA  arrays  consisting  of  24481  genes  with  78  samples  in  the  training  set,  and 
19  samples  in  the  test  set.  These  samples  are  divided  into  two  categories:  the  good  prognosis  group 
(patients  who  remained  disease  free  for  at  least  5  years)  and  the  poor  prognosis  group  (patients  who 
developed  distant  metastases  within  5  years).  We  identified  4919  significantly  regulated  genes  (at 
least  a  two-fold  difference  and  p-value  <  0.01  in  least  3  samples)  from  the  training  set.  We  further 
deleted  two  samples  with  missing  values  from  the  training  set.  Therefore,  the  breast  cancer  prognosis 
training  set  used  in  our  experiments  consists  of  76  samples  and  the  test  set  consists  of  19  samples 
(see  Table  1)  across  4919  genes. 

We  applied  the  iterative  BMA  algorithm  for  binary  classification  to  the  breast  cancer  prognosis  data, 
and  achieved  a  comparable  number  of  classification  errors  on  the  test  set  to  the  reported  results  in 
van’t  Veer  et  al.  (2002)  while  using  significantly  fewer  relevant  genes.  We  experimented  with  various 
control  parameters  for  the  iterative  BMA  algorithm  in  our  study,  including  the  number  of  models 
returned  by  the  leaps  and  bounds  algorithm  for  up  to  30  variables  (nbest)  and  the  number  of  top 
genes  ranked  by  BSS/WSS  ratios  (p).  We  observed  that  a  large  p  (1000  or  more  genes)  typically 
yields  lower  Brier  Scores  and  classification  errors,  and  with  the  exception  of  nbest=10,  which  is  too 
small,  the  prediction  accuracy  and  the  number  of  selected  genes  are  relatively  insensitive  to  “nbest”. 

Using  all  4919  genes  and  nbest=20,  our  iterative  BMA  algorithm  produced  3  classification  errors  on 
the  test  set  (out  of  19  samples)  and  a  Brier  Score  of  2.04  using  6  selected  genes,  van’t  Veer  et  al. 
(2002)  reported  2  classification  errors  on  the  test  set  using  70  relevant  genes.  There  is  only  one 
common  gene  between  our  6  selected  genes  and  the  70  relevant  genes  from  van’t  Veer  et  al.  (2002). 
This  is  probably  due  to  the  fact  that  4  out  of  our  6  selected  genes  have  poor  univariate  rankings 
(above  200,  see  Table  2).  In  addition,  the  70  relevant  genes  from  van’t  Veer  et  al.  (2002)  are  chosen 
due  to  a  high  correlation  (>  0.3  or  <-0.3)  with  the  response  variable.  Some  of  these  high  correlation 
genes  may  be  correlated  among  themselves.  For  example,  among  the  top  10  correlated  genes  (with 
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the  response  variable)  from  the  70-gene  subset,  four  of  them  have  correlation  greater  than  0.3  with 
the  top  ranking  gene  AL080059. 

Our  results  demonstrate  the  power  of  our  multivariate  BMA  gene  selection  procedure  that  explores  all 
p  given  genes:  genes  with  poor  univariate  rankings  may  be  beneficial  in  classification  when  used  in 
combination  with  other  genes.  By  choosing  our  relevant  genes  from  sets  of  genes,  the  iterative  BMA 
algorithm  greatly  reduces  the  number  of  relevant  genes  needed  for  accurate  class  prediction. 
Furthermore,  these  6  selected  genes  are  used  in  13  selected  models,  each  of  which  consists  of  3  to  6 
genes.  The  predicted  probabilities  for  the  19  test  samples  are  illustrated  in  the  uncertainty  plot  in 
Figure  2,  in  which  the  uncertainty  (1  -  Pr(Y=1|D))  is  plotted  against  the  test  samples,  sorted  by 
increasing  uncertainty  (Bensmail  et  al.  1 997).  Figure  2  shows  that  two  out  of  the  three  misclassified 
test  samples  have  high  uncertainty,  indicating  that  our  assessment  of  uncertainty  does  correspond 
with  the  errors  actually  made,  as  we  would  wish. 


3.2  Leukemia  data  (2  and  3  classes) 

The  leukaemia  dataset  (Golub  et  al.  1999)  consists  of  7129  genes,  38  samples  in  the  training  set  and 
34  samples  in  the  test  set.  We  filtered  out  genes  that  do  not  exhibit  significant  variation  across  the 
training  samples,  leaving  3051  genes,  and  then  performed  thresholding  and  the  logarithmic 
transformation.  The  data  consist  of  samples  from  patients  with  either  Acute  Lymphoblastic  Leukemia 
(ALL)  or  Acute  Myeloid  Leukemia  (AML).  However,  it  has  also  been  noted  that  the  global  expression 
profiles  also  reflect  two  ALL  sub-types  (B-cell  and  T-cell)  (Golub  et  al.  1999).  Hence,  this  dataset  can 
be  divided  into  either  2  or  3  classes  (see  Tables  3a  and  3b). 

We  first  applied  the  iterative  BMA  algorithm  to  the  2-class  leukemia  data,  and  achieved  a  comparable 
number  of  classification  errors  on  the  test  set  to  other  reported  results  in  the  literature.  Specifically,  we 
observed  a  Brier  Score  of  1 .5,  with  2  classification  errors  on  the  test  set  (out  of  34  samples)  with  20 
selected  genes,  using  nbest=20  and  p=1000  top  ranked  genes1.  Similarly  to  what  happened  with  the 
breast  cancer  prognosis  data,  13  (out  of  20)  selected  genes  have  poor  univariate  BSS/WSS  rankings 
(above  200).  This  dataset  is  widely  used  in  classification  and  feature  selection  papers  in  the  literature. 
For  example,  Nguyen  and  Rocke  (2002b)  reported  1  to  3  classification  errors  on  the  test  set  using  50 


Using  all  3051  genes  yielded  unstable  models.  We  observed  this  unstable  model  phenomenon  on  this 
thresholded  dataset  (in  which  expression  values  are  thresholded  by  100  and  16000  before  applying  the 
logarithmic  transformation)  only,  but  not  on  other  unthresholded  datasets.  This  is  probably  because  some  genes 
with  low  BSS/WSS  rankings  have  many  identical  thresholded  values  across  the  samples  leading  to  singular 
matrices  in  our  computation. 
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to  1500  selected  genes.  They  also  noted  that  test  sample  #66  is  consistently  misclassified  in  the 
microarray  community  and  suggested  that  the  sample  might  be  incorrectly  labelled.  Sample  #66  is 
one  of  the  two  misclassified  samples  in  our  results.  Our  iterative  BMA  algorithm  consistently 
misclassified  sample  #66  in  all  of  our  experiments  using  different  parameter  values  (nbest  and  p).  Lee 
et  al.  (2003)  reported  one  of  the  most  favorable  results  in  the  literature,  which  is  1  classification  error 
using  5  genes.  However,  it  is  not  clear  whether  sample  #66  was  misclassified  in  their  reported  results. 

Next,  we  applied  our  multi-class  iterative  BMA  algorithm  to  the  3-class  leukemia  data  (AML,  ALL-B 
cell,  ALL-T  cell).  This  produced  very  encouraging  results:  a  Brier  Score  of  1 .5  with  1  classification 
error  on  the  test  set  (34  samples),  using  15  genes  (nbest  =  20,  p  =  1000).  Figure  3  shows  the 
uncertainty  plot  and  Table  4  shows  the  selected  genes  and  their  corresponding  posterior  probabilities. 
It  is  interesting  that  we  achieve  a  similar  Brier  Score  in  the  3-class  case  as  in  the  2-class  case.  Six  out 
of  the  15  relevant  genes  were  selected  from  the  binary  classification  problem  comparing  AML  to  ALL- 
B  cell  (Y=0  vs.  Y=1 ),  and  nine  genes  were  selected  from  comparing  AML  to  ALL-T  cell  (Y=0  vs.  Y=2). 
Nguyen  and  Rocke  (2002a)  pooled  the  training  and  test  sets  (38+34=72  samples)  and  classified  each 
of  the  72  samples  in  turn  using  the  classifier  built  using  the  remaining  71  samples.  They  reported  4 
classification  errors  out  of  72  samples  with  69  to  100  genes,  using  leave-one-out  cross  validation.  We 
not  only  produced  a  lower  error  rate  (3%  compared  to  5.5%)  and  good  Brier  Scores,  but  also  our  BMA 
classifier  was  built  using  only  38  training  samples.  Recently,  Lee  and  Lee  (2003)  also  applied  the 
multicategory  support  vector  machine  to  the  training  set  (with  38  samples)  of  the  3-class  leukaemia 
data.  Their  best  result  is  1  classification  error  on  the  test  set  (with  34  samples)  using  40  relevant 
genes. 


3.3  Hereditary  breast  cancer  data  (3  classes) 

Hedenfalk  et  al.  (2001)  studied  the  expression  patterns  of  hereditary  breast  cancer  with  gene 
mutations  (BRCA1  or  BRCA2  mutations).  The  hereditary  breast  cancer  dataset  consists  of  7  samples 
of  cancers  with  BRCA1  mutation,  8  samples  with  BRCA2  mutation,  and  7  sporadic  cases  of  primary 
breast  cancers  over  3226  genes.  There  is  no  separate  test  set  available,  so  we  use  leave-one-out 
cross  validation  (LOOCV)  in  which  each  of  the  22  samples  is  used  in  turn  as  the  test  sample  and  a 
classifier  is  built  using  the  remaining  21  samples. 

We  applied  the  multi-class  iterative  BMA  algorithm  to  this  three-class  data,  and  obtained  encouraging 
results:  a  Brier  Score  of  5.5  with  6  classification  errors  (out  of  22  samples)  with  13  to  18  relevant 
genes,  using  all  3226  genes  and  nbest=50.  Since  LOOCV  is  used,  a  different  classifier  is  built  for 
each  test  sample,  so  the  number  of  relevant  genes  may  vary  in  each  classifier.  Nguyen  and  Rocke 
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(2002a)  reported  6  classification  errors  with  343  to  438  relevant  genes  using  their  proposed  partial 
least  squares  gene  selection  method  on  the  same  dataset. 


4  DISCUSSION 

We  have  proposed  iterative  BMA  algorithms  for  gene  selection  on  binary  and  multi-class  microarray 
data.  Both  are  multivariate  gene  selection  methods  in  which  dependency  between  genes  is  exploited. 
Our  algorithms  take  advantage  of  model  uncertainty  by  averaging  over  multiple  models  (sets  of 
relevant  genes).  We  demonstrated  high  prediction  accuracy  using  smaller  numbers  of  genes  (relative 
to  other  methods)  on  both  binary  and  multi-class  microarray  datasets.  Table  5  shows  an  overall 
summary  of  our  results.  In  addition,  our  algorithms  produce  posterior  probabilities  for  both  selected 
genes  and  models,  and  these  posterior  probabilities  aid  biological  interpretation.  We  also  observed 
that  the  selected  models  are  generally  very  simple,  containing  only  a  few  genes.  Furthermore,  we 
adopted  the  Brier  Score  and  used  the  generalized  Brier  Score  to  evaluate  prediction  accuracy,  taking 
the  posterior  probabilities  for  the  response  variables  into  consideration. 

Unlike  most  feature  selection  algorithms,  in  which  a  pre-specified  number  (usually  small)  of  top 
ranked  genes  are  chosen  as  relevant  genes  and  all  the  remaining  genes  are  discarded,  our  Iterative 
BMA  algorithm  guarantees  that  all  p  genes  are  considered  even  though  the  resulting  selected  genes 
and  models  depend  on  the  initial  ranking.  We  show  that  genes  with  poor  univariate  scores  may 
contribute  to  increased  prediction  accuracy,  and  we  recommend  using  all  available  genes  (i.e.,  p  =  G) 
in  the  iterative  BMA  algorithms,  except  in  the  case  of  thresholded  data.  From  our  experiments, 
nbest=20  or  50  generally  yield  good  results. 

In  order  to  efficiently  compute  a  reduced  set  of  good  models,  we  use  the  leaps  and  bounds  algorithm 
(Furnival  and  Wilson  1974),  which  returns  the  best  “nbest”  models  for  each  size  up  to  30  variables. 
This  imposes  a  restriction  of  a  30-variable  window  on  our  iterative  BMA  algorithms,  which  in  turn  limits 
our  algorithms  to  choosing  at  most  30  relevant  genes.  Although  this  restriction  does  not  seem  to  hurt 
performance,  we  are  currently  in  the  process  of  exporting  our  BMA  software  from  Splus  to  R,  and 
relaxing  this  30-variable  limitation.  Our  current  implementation  is  computationally  efficient.  For 
example,  it  takes  under  30  minutes  to  run  our  iterative  BMA  algorithm  on  the  binary  breast  cancer 
prognosis  dataset  (nbest=20  and  p  =  4919)  on  a  moderate  computer  with  a  1 ,4GHz  AMD  Athlon 
processor.  Another  future  project  is  to  study  the  effect  of  the  chosen  baseline  (Y=0)  in  our  multi-class 
iterative  BMA  algorithm.  Our  preliminary  results  show  that  changing  the  baseline  response  variable 
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does  not  affect  predictive  performance  much.  However,  the  number  of  relevant  genes  chosen  can  be 
different. 

The  combination  of  high  accuracy,  small  numbers  of  genes  and  posterior  probabilities  for  the 
predictions,  should  make  BMA  an  attractive  tool  for  developing  diagnostics  from  expression  data.  The 
posterior  probability  of  the  prediction  provides  an  estimate  of  the  certainty  of  the  classification,  which 
can  be  useful  in  a  diagnostic  setting. 
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7  TABLES 


Table  1  -  Prognosis  groups  and  class  sizes  of  the  training  set  and  test  set  of  the  breast  cancer 
prognosis  data.  Y  is  the  response  (class)  variable. 


Prognosis  group 

Y 

Training  set  (total  76) 

Test  set  (total  19) 

Poor  (develop  metastases  within  5  years) 

0 

33 

12 

Good  (disease  free  for  at  least  5  years) 

1 

43 

7 

Table  2:  Selected  genes  and  their  corresponding  posterior  probabilities  of  not  being  equal  to  zero 
(probneO),  BSS/WSS  ranks,  and  membership  in  the  70-gene  signature  chosen  by  van't  Veer  et  al. 
(2002)  for  the  breast  cancer  prognosis  data  using  4919  genes  and  nbest=20.  The  genes  are  shown  in 
descending  order  of  probneO. 


selected  genes 

probne0(%) 

BSS/WSS  rank 

in  70-gene  signature? 

gene  description 

AL080059 

100.0 

1 

yes 

Homo  sapiens  mRNA;  cDNA 

DKFZp564H142  (from  clone 

DKFZp564H142) 

Contig49670_RC 

80.8 

95 

no 

Homo  sapiens  cDNA:  FLJ23228  fis,  clone 
CAE06654 

NM_012214 

70.8 

201 

no 

mannosyl  (alpha-1, 3-)-glycoprotein  beta-1, 4- 
N-acetylglucosaminyltransferase,  isoenzyme 
A 

Contig59951 

57.3 

793 

no 

RAD21  (S.  pombe)  homolog 

Contig46443_RC 

57.3 

1349 

no 

ESTs,  Weakly  similar  to  AF279265  1 
putative  anion  transporter  1  [H. sapiens] 

wmm^m 

41.4 

423 

no 

tetratricopeptide  repeat  domain  2 

Table  3:  Groups  and  class  sizes  of  the  training  and  test  sets  of  the  leukemia  data.  Y  is  the  response 
(class)  variable, 
a.  2-class  (ALL  vs.  AML) 


Class 

Y 

Training  set  (total  38) 

Test  set  (total  34) 

ALL  (Acute  Lymphoblastic  Leukemia) 

0 

27 

20 

AML  (Acute  Myeloid  Leukemia) 

1 

11 

14 

b.  3-class  (AML  vs.  ALL-B  cell  vs.  ALL-T  cell) 


Class 

Y 

Training  set  (total  38) 

Test  set  (total  34) 

AML 

0 

11 

14 

ALL-B  cell 

1 

19 

19 

ALL-T  cell 

2 

8 

1 

Table  4:  Selected  genes  and  their  corresponding  posterior  probabilities  of  not  being  equal  to  zero 
(probneO),  and  BSS/WSS  ranks  for  the  3-class  leukemia  data  using  p=1000  genes  and  nbest=20.  The 
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BSS/WSS  ranks  represent  the  ranks  in  the  binary  logistic  regression  (Y=0  vs.  Y=1)  or  (Y=0  vs.  Y=2). 
If  a  gene  is  selected  in  only  one  binary  logistic  regression,  a  blank  entry  is  shown.  For  example, 
X03934_at  was  ranked  #1  in  the  binary  regression  between  AML  (Y=0)  and  ALL-T  cell  (Y=2),  but 
X03934_at  was  not  selected  in  the  binary  regression  between  AML  (Y=0)  and  ALL-B  cell  (Y=1).  The 
genes  are  shown  in  descending  order  of  probneO. 


selected  genes 

probneO  (%) 

BSS/WSS  rank 

Y=0  vs.  1  Y=0  vs.  2 

gene  description 

M27891_at 

100.0 

1 

CST3  Cystatin  C  (amyloid  angiopathy  and  cerebral  hemorrhage) 

32.6 

279 

MANA2  Alpha  mannosidase  II  isozyme 

30.9 

1 

GB  DEF  =  T-cell  antigen  receptor  gene  T3-delta 

linsmi 

30.9 

2 

TCF7  Transcription  factor  7  (T-cell  specific) 

DHH 

18.7 

152 

54  kDa  protein  mRNA 

8.1 

213 

OX-2  MEMBRANE  GLYCOPROTEIN  PRECURSOR 

iSH 

8.1 

312 

Kinectin  gene 

X74008_at 

8.0 

802 

PPP1CC  Protein  phosphatase  1,  catalytic  subunit,  gamma 
isoform 

IBB 

8.0 

112 

Butyrophilin  (BTF5)  mRNA 

SBUKt— i 

7.9 

354 

Ras  GTPase-activating-like  protein  (IQGAP1)  mRNA 

Bumi 

6.6 

974 

Sigma  3B  protein 

Ifi  ' 

5.7 

523 

Prostaglandin  D2  synthase  gene 

mm 

5.7 

931 

GB  DEF  =  Somatostatin  receptor  isoform  2  (SSTR2)  gene 

mm 

5.3 

972 

Extracellular  matrix  protein  collagen  type  XIV,  C-terminus 

IB9H 

5.1 

1000 

PROBABLE  G  PROTEIN-COUPLED  RECEPTOR  GPR3 

Table  5:  Summary  of  our  results.  The  number  of  relevant  genes,  Brier  Score  and  the  number  of 
classification  errors  on  the  test  set  obtained  from  our  iterative  BMA  algorithms  are  shown  in  column  4. 
The  number  of  relevant  genes  and  number  of  classification  errors  on  the  test  set  from  published 
results  are  shown  in  column  5.  *Results  from  the  hereditary  breast  cancer  data  were  evaluated  using 
leave-one-out  cross  validation  (LOOCV). 


Dataset 

#  classes 

Classes 

Results  from  our 
iterative  BMA 
algorithms 

Published  results 

Breast  cancer 
prognosis  data 

2 

Poor  vs.  Good 
prognosis  groups 

#  genes  =  6 

Brier  Score  =  2.04 

#  errors  =  3/19 

#  genes  =  70 

#  errors  =  2/19 

Leukemia  data 

3 

AML  vs.  ALL-B  cell 
vs.  ALL-T  cell 

#  genes  =  15 

Brier  Score  =  1 .5 

#  errors  =  1/34 

#  genes  =  40 

#  errors  =  1/34 

Hereditary  breast 
cancer  data* 

3 

Sporadic  vs. 

BRCA1  vs.  BRCA2 

#  genes  =  13  to  18 
Brier  Score  =  5.5 

#  errors  =  6/22 

#  genes  =  343  to  438 

#  errors  =  6/22 
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8  FIGURES 


Figure  1:  A  flowchart  illustrating  the  multi-class  iterative  BMA  algorithm  for  K=3.  Suppose  two  genes 
X-|  and  x2  are  selected  in  the  two  binary  logistic  regressions  (Y=0  vs.  Y=1  and  Y=0  vs.  Y=2)  from  the 
iterative  BMA  algorithm.  The  goal  of  polychotomous  regression  is  to  estimate  the  regression 
parameters  for  g1(x)=ln[P(Y=1  |D)/P(Y=0|D)]=b10  +  bi^  +  b12x2  and  g2(x)=  ln[P(Y=2|D)/P(Y=0|D)]=b20 
+  b2ix-,  +  b22x2.  The  augmented  matrix  M  consists  of  an  intercept  column  (b20  -  b10)  and  a  column  for 
each  regression  parameter  bn,  b12,  b21,  and  b22. 
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Figure  2:  Uncertainty  plot  for  the  predicted  probabilities  on  the  test  set  (19  samples)  of  the  breast 
cancer  prognosis  data.  The  y-axis  represents  the  uncertainty  (1  -  predicted  probability  of  Y=1),  and 
the  x-axis  represents  the  19  test  samples  sorted  in  increasing  order  of  uncertainty.  The  follow-up  time 
of  patients  is  used  to  label  the  upper  x-axis.  The  vertical  bars  represent  classification  errors.  In  other 
words,  test  samples  #  102,  117,  109  with  follow-up  times  3.3,  5.3,  3.2  respectively  were  misclassified. 
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lotloM/up  time 
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Figure  3:  Uncertainty  plot  for  the  predicted  probabilities  on  the  test  set  (34  samples)  of  the  3-class 
leukemia  data.  Each  sample  is  classified  as  being  in  the  class  j  with  the  maximum  predicted 
probability  Pr(Y=j|D),  where  j=0,  1,  2.  The  y-axis  represents  the  uncertainty  (1  -  maximum  predicted 
probability),  and  the  x-axis  represents  the  34  test  samples  sorted  in  increasing  order  of  uncertainty. 
The  vertical  bar  represents  a  misclassified  sample. 

ALL  AML  3  class,  nbest=20,  p=1000 
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